-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down #113
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I returned contexts to successfully handle the situation when the helper dies with signal 9 or 11. |
869d865
to
0c9f756
Compare
It looks that If I use previously attached 6-segment backup with injected error on a 1-segment cluster with
it hangs for 300 sec timeout on 2nd and 3rd tables, and finally doesn't restore them. |
This is not a problem with this patch, because it is reproduced without it! |
Rework Fix gprestore/gpbackup hanging in case the helper goes down
Commit 8060b4e starts a goroutine in gpbackup/gprestore which polls every 5
seconds if the helper has failed and cancels pending COPY commands via the
execution context. I think it's a big overhead to execute commands via ssh so
often, especially on large clusters.
In case of a fatal error on the helper, gprestore/gpbackup could hang forever.
gprestore hung because the COPY command was expecting data from a pipe file
(via 'cat <pipe_file>') which was deleted in the helper's DoCleanup function
before any data was put into the pipe by the restore helper, when the restore
helper exited due to some error. gpbackup hung for similar reasons - the backup
helper exited before it opened the pipe for reading.
The new solution is quite simple: on exit, if was not sigpiped, the helper
opens and closes the pipes that the COPY command is waiting for, which also
causes it to exit. Now we don't need pipe lists in helpers, because an error in
a helper can occur before these lists are filled.
The test has been modified to show and fix more relevant results.